Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Variant Discovery ◾ 111

sequencing, variant calling software that created the file, or the reference genome used for

determining variants. The first few lines of metadata section describe file format, file date,

source program, and the reference used. The metadata section also declares and describes

the fields provided at both the site-level (INFO) and sample-level (FORMAT) in the data

lines of the data section. For example, the following metadata lines describe the ID, data

type, and description of some fields that can be found in the INFO and FORMAT columns

in the data section:

##INFO=<ID=NS,Number=1,Type=Integer,Description=”Number of Samples

Data”>

##INFO=<ID=DP,Number=1,Type=Integer,Description=”Total Depth”>

##INFO=<ID=AF,Number=A,Type=Float,Description=”Allele Frequency”>

##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”>

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype

Quality”>

##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Read Depth”>

The data section begins with a tab-delimited single header line that has eight mandatory

fields representing columns for each data line (Table 4.1). The column headers of the data

section are as follows:

#CHROM POS ID REF ALT QUAL FILTER INFO

Only if there is genotype data, then a FORMAT column is declared and followed by unique

sample names. All of these column names must be separated by tabs as well. Each line in

the data section represents a position in the genome. The data corresponds to the columns

specified in the header and must be separated by tabs and ended with a new line. Below are

the columns and their expected values. In all cases, MISSING values should be represented

by a dot (“.”).

As shown in Figure 4.1, the variants are in chromosome 20 on the reference genome

NCBI36 (hg18). The figure shows five positions whose coordinates are 14370, 18330,

TABLE 4.1 VCF File Columns

Column #

Column

Description

#CHROM

A chromosome identifier (e.g., 11, chr11, X or chrX)

POS

A reference position (sorted numerically in ascending order by chromosome)

Variant IDs separated by semicolons (no whitespaces allowed)

REF

A reference base (A, C, G, or T). Insertions are represented by a dot

ALT

A comma-separated alternate base(s) (A, C, G, or T). Deletions are represented by

a dot

QUAL

A quality score in a log scale (Phred quality score)

FILTER

This indicates which filters failed (semicolon-separated), PASS or MISSING

INFO

A site-level information in semicolon-separated name-value format

FORMAT

A sample-level field name declarations separated by semicolons